An algorithm for local geoparsing of microtext

نویسندگان

  • Judith Gelernter
  • Shilpa Balaji
چکیده

The location of the author of a social media message is not invariably the same as the location that the author writes about in the message. In applications that mine these messages for information such as tracking news, political events or responding to disasters, it is the geographic content of the message rather than the location of the author that is important. To this end, we present a method to geo-parse the short, informal messages known as microtext. Our preliminary investigation has shown that many microtext messages contain place references that are abbreviated, misspelled, or highly localized. These references are missed by standard geo-parsers. Our geo-parser is built to find such references. It uses Natural Language Processing methods to identify references to streets and addresses, buildings and urban spaces, and toponyms, and place acronyms and abbreviations. It combines heuristics, open-source Named Entity Recognition software, and machine learning techniques. Our primary data consisted of Twitter messages sent immediately following the February 2011 earthquake in Christchurch, New Zealand. The algorithm identified location in the data sample, Twitter messages, giving an F statistic of 0.85 for streets, 0.86 for buildings, 0.96 for toponyms, and 0.88 for place abbreviations, with a combined average F of 0.90 for identifying places. The same data run through a geo-parsing standard, Yahoo! Placemaker, yielded an F statistic of zero for streets and buildings (because Placemaker is designed to find neither streets nor buildings), and an F of 0.67 for toponyms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Microtext Streams for Event Identification

The popularity of microblogging systems has resulted in a new form of Web data – microtext – which is very different from conventional well-written text. Microtext often has the characteristics of informality, brevity, and varied grammar, which poses new challenges in applying traditional clustering algorithms to analyze microtext. In this paper, we propose a novel two-phase approach for cluste...

متن کامل

Dynamic Microcluster Chains in Microtext

Two features of microtext that challenge language processing tools are addressed in the context of linking messages in the emergency response domain. First, the effect of very short texts on several classifiers is estimated by comparing the results when classifiers are applied to the full text of news reports vs. only the headlines. These experiments demonstrate a decrease of 5 20% in accuracy....

متن کامل

Facial expression recognition based on Local Binary Patterns

Classical LBP such as complexity and high dimensions of feature vectors that make it necessary to apply dimension reduction processes. In this paper, we introduce an improved LBP algorithm to solve these problems that utilizes Fast PCA algorithm for reduction of vector dimensions of extracted features. In other words, proffer method (Fast PCA+LBP) is an improved LBP algorithm that is extracted ...

متن کامل

A Hybrid Algorithm using Firefly, Genetic, and Local Search Algorithms

In this paper, a hybrid multi-objective algorithm consisting of features of genetic and firefly algorithms is presented. The algorithm starts with a set of fireflies (particles) that are randomly distributed in the solution space; these particles converge to the optimal solution of the problem during the evolutionary stages. Then, a local search plan is presented and implemented for searching s...

متن کامل

Normalizing Microtext

The use of computer mediated communication has resulted in a new form of written text—Microtext—which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Microtext poses new challenges to standard natural language processing tools which are usually designed for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • GeoInformatica

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2013